We claim: 



1 , Ink data processing system wherein descriptor vectors associated with a plurality 
^ \ of regions of moleVules are stored in a database, a method for generating and storing data 

0 Vharacterizing at leai^t one region of said plurality of regions, the method comprising the steps 

generating an entry comprising i) an identifier that identifies said at least one region, and 
ii) data characterizing a set of axes derived from property distribution of said at least one region; 

applying a mapping the descriptor vector associated with said at least one region; 



im 



generating a key thai corresponds to said mapping of the descriptor vector associated with 
said at least one region; and 



storing said entry in a memory, wherein said key is associated with said entry. 



2, The 



translation of said at least yone region. 




of claim 1, wherein said set of axes are invariant to rotation and 



3. The met^od)s^f claim 2, wherein said set of axes are derived from principal axes of 
1 5 said property distribution. 



4. The method of claim 3, wherein said property distribution of said at least one 
N\region is based upon Application of a smearing function to a property field. 

\y / 5. The methbd of claim 1, wherein said plurality of descriptor vectors are classified 
mto groups, wherein said mapping step maps said descriptor vector to a said space optimally 
20 discriminates between said groups of descriptor vectors. J 
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6. The method of claim 5, wherein said mapping is derived from the steps of: 

V 

generating first data representing differences between said groups of descriptor vectors; 

generating second data representing variations within said groups of descriptor vectors; 

identifying a set of component vectors that maximizes an F distributed criterion function, 
said criterion function having a numerator based upon said first data and a denominator based 
upon said second data; 

generating an F distributed statistic for subsets of said component vectors, said statistic 
having a numerator based upon said first data and a denominator based upon said second data; 

for each particular subset of component vectors, calculating a probability value for the 
F-distributed statistic associated with the particular subset; 

selecting a probability value from probability values for said subsets of component 
vectors based upon a predetermined criterion; 

identifying the subset of said component vectors associated with the selected probability 
value; and 



generating a mapping to a space/corresponding to the subset of component vectors 
associated with the selecte^probability value, and storing the mapping for subsequent 
processing. 



7. The method of claim 6, wherein said first data comprises a matrix Sb 
representing covariance between said groups of descriptor vectors, and said second data 
comprises a matrix e^, representing covariance within said groups of descriptor vectors. 

8. The method of claim 7, wherein said criterion function has the general form: 
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where w is some vector, and C is a constant based upon degrees of freedom in St and Sw . 

9. The method of claim 8, wherein C is determined as follows: 

_ 1 /degrees of freedom in _ 1 / (A^ - 1) 
^ 1 /degrees of freedom in ^ / rit-N) 

where represents the number of groups of descriptor vectors, rii represents the number of 
regions, and S represents the sum of rii for the N groups. 

10. The method of claim 7, wherein the step of identifying a set of component vectors 
that maximizes an F distributed criterion function comprises the substeps of: 

determining a set of (eigenvalue, eigenvector) pairs for the matrix e^v 

determining said set of component vectors based upon said set of (eigenvalue, 
eigenvector) pairs for the matrix e^^ . 

1 1 . The method of claim 10, wherein said statistic for a given subset of component 
vectors is based upon value of said criterion function for said subset of component vectors. 

12. The method of claim 11, wherein said statistic for a given subset of component 
vectors has the following form: 



where fk represents the value of the criterion function at a 
component vector in the given subset, 
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C is a constant, 

Ls represents the number of fk values in the given subset 
of component vectors, and 

the ^ operation sums over the Ls fk values in the given 
subset of component vectors. 



1 3 . The method of claim 1 2, wherein said a probability value for a particular 
F-distributed statistic represents a probability value that the particular F-distributed statistic could 
have been larger by chance. 

14. The method of claim 13, wherein said probability value selected from probability 
values for said subsets of component vectors is a minimum probability value of said probability 
values for said subsets of component vectors. 

15. The method of claim 6,' 

wherein said mapping for said at least one descriptor vector performs a loop over each 
component vector belonging to the subset of component vectors associated with the selected 
probability; 

wherein, in each iteration of said loop, dot product of said descriptor vector with a 
transpose of a unit vector for the given component vector is added to a running sum. 



I6y In a data prc^essing system wherein descriptor vectors associated with a plurality 
of regions of molecules are stored in a database, CHARACTERIZED IN THAT said data 
processing system includes a m^ovy storing a plurality of entries each comprising i) an 
identifier that identifies at least one regipn^nd ii) data characterizing a set of axes derived from 



property distribution of said at leastVuie region, a method for determining alignment of similar 
molecular structure comprising^he steps of: 
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providing a descrimor vector associated with said query molecular region; 

mapping said descriptor vector associated with said query molecular region; 

generating a second key\that corresponds to said mapping of said descriptor vector 
associated with said query molecular region; and 

retrieving from said memory\entries that are associated with a first key that corresponds 
to said second key; and 
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for at least one entiy retrieved froxn said memory, 

generating data that represents a match hypothesis associated with said query 
molecular region and at least one region R identified by said at least one entry retrieved from said 
memory, wherein said data is based upon parameter^f a transformation that aligns a set of axes 
derived from property distribution of said query\{{riolec\ilar region with a set of axes derived from 
property distribution of said at least one region y 



determining a score associate( 




storing said data and score as an entry in a vote table. 

17. The method of claim 16, further comprising the step of: 

selecting one or more entries of said vote table b^sed upon said score associated 
with said entries; and 

identifying at least one region that corresponds to the ^elected entries of said vote 
table as a potential matching regions to said query molecular region. 
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1 8. The method of claim 16, wherein said set of axes derived from property 
distribution of a regiomare invariant to rotation and translation of said region. 

19. The method of claim 1 8, wherein said set of axes derived from property 
distribution of a region areXderived from principal axes of said property distribution. 

20. The method oficlaim 19, wherein said property distribution of said region is based 
upon application of a smearingVfunction to a property field. 
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2L The method of claim 1 6, wherein said plurality of descriptor vectors are classified 
into groups, and wherein said mapping step maps said descriptor vector to a space optimally 
discriminates between said groups of (descriptor vectors. 

22. The method of claim 21, whe/eiir.^aiH mapping is derived from the steps of: 



generating first data representing dict^ences 



generating second data representimg variations 



i^een said groups of descriptor vectors; 



within said groups of descriptor vectors; 



identifying a set of component vectors that rmximizes an F distributed criterion function, 
said criterion function having a numerator based uponysaid first data and a denominator based 
upon said second data; 



generating an F distributed statistic for subsets of said component vectors, said statistic 
having a numerator based upon said first data and a denominator based upon said second data; 



for each particular subset of component vectors, calculating a probability value for the 
F-distributed statistic associated with the particular subset; 
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selecting a probability value from probability values for said subsets of component 
vectors based upon a predetermined criterion; 

identifying the s\ibset of said component vectors associated with the selected probability 
value; and 

generating a mapping to a space corresponding to the subset of component vectors 
associated with the selecteoyprobability value, and storing the mapping for subsequent 
processing. 
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23. The method of ckim 22, wherein said first data comprises a matrix St, 
representing covariance between said groups of descriptor vectors, and said second data 
comprises a matrix e,v representing covariance within said groups of descriptor vectors. 

24. The method of claim ^3, w)ierein said criterion function has the general form: 
f{w) = c(^] 



where w is some vector, and C is a constant basec 



25. The method of claim 2#, whe 



ion degrees of freedom in Eb and 
[determined as follows: 



1/degrers of free^m nr"£i, 



\l {N~\) 



1 /degrees of freedom in five 1 / n -N) 



where A'^ represents the number of groups of descrintor vectors, represents the number of 
regions, and X represents the sum of rij for the N groups. 

26. The method of claim 23, wherein the sten of identifying a set of component 
vectors that maximizes an F distributed criterion functionXcomprises the substeps of: 
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determining a set of (eigenvalue, eigenvector) pairs ror the matrix Em 
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determining Said set of component vectors based upon said set of (eigenvalue, 
eigenvector) pairs foA the matrix Sw . 

27. The memod of claim 26, wherein said statistic for a given subset of component 
vectors is based upon value of said criterion function for said subset of component vectors. 

5 28. The method\of claim 27, wherein said statistic for a given subset of component 

vectors has the following for 
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where fu represents the value of the criterion function at a 
component vector in the given subset, 
C is a constant, 

F^^esents the number of fk values in the given subset 
of comppnept^vectors, and 
e H operation sums over the Ls fk values in the given 
bset of Component vectors. 



28. The method of claim 22, wierein said a probability value for a particular 
F-distributed statistic represents a probability value that the particular F-distributed statistic could 
have been larger by chance. 
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29. The method of claim 28, wherein said probability value selected from probability 
values for said subsets of component vectors is a\piinimum probability value of said probability 
values for said subsets of component vectors. 



30. The method of claim 22, 
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wherein said mapping for said at least one descriptor vector performs a loop over each 
component vector belonging to the subset of component vectors associated with the selected 
probability; 

wherein, in each iteration of said loop, dot product of said descriptor vector with a 
transpose of a unit vector for the given component vector is added to a ruiming sum. 
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